Read Graphical Data Analysis with R, Ch. 4, 5

Grading is based both on your graphs and verbal explanations. Follow all best practices as discussed in class. Data manipulation should not be hard coded. That is, your scripts should be written to work for new data.

1. useR2016! survey

library(forwards)
library(ggplot2)
library(dplyr)

[18 points]

Data: useR2016 dataset in the forwards package (available on CRAN)

For parts (a) and (b):

  1. Create a horizontal bar chart of the responses to Q20.
ggplot(subset(useR2016, !is.na(Q20)), aes(Q20)) + geom_bar() + coord_flip()

  1. Create a vertical bar chart of the responses to Q11.
 ggplot(subset(useR2016, !is.na(Q11)), aes(Q11)) + geom_bar() + coord_flip()

  1. Create a horizontal stacked bar chart showing the proportion of respondents for each level of Q11 who are over 35 vs. 35 or under. Use a descriptive title.
d <- useR2016[!is.na(useR2016$Q11) & !is.na(useR2016$Q3), ] %>% 
  group_by(Q11, Q3) %>% 
  summarise(count=n()) %>% 
  mutate(proportion=count/sum(count))
ggplot(d, aes(x=Q11, y=proportion, fill=Q3)) + geom_bar(stat="identity") + coord_flip()

  1. Create a horizontal stacked bar chart showing the proportional breakdown of Q11 for each level of Q3, faceted on Q2. Use a descriptive title.
d <- useR2016[!is.na(useR2016$Q11) & !is.na(useR2016$Q3), ] %>% 
  group_by(Q2,Q3, Q11) %>% 
  summarise(count=n()) %>% 
  mutate(proportion=count/sum(count))
ggplot(d, aes(x=Q3, y=proportion, fill=Q11)) + geom_bar(stat="identity") + facet_grid(Q2~.) + coord_flip() 

  1. For the next part, we will need to be able to add line breaks (\n) to long tick mark labels. Write a function that takes a character string and a desired approximate line length in number of characters and substitutes a line break for the first space after every multiple of the specified line length.
add_line_breaks <- function(string, num){
  l <- '\n'
  lhs <- paste0('^([a-z]{', num-1, '})([a-z]+)$')
  rhs <- paste0('\\1', l, '\\2')
  return(gsub(lhs, rhs, string))
}
  1. Create a horizontal bar chart that shows the percentage of positive responses for Q13 - Q13_F. Use your function from part (e) to add line breaks to the responses. Your graph should have one bar each for Q13 - Q13_F.

2. Rotten Tomatoes

[18 points]

library(robotstxt)
library(rvest)
theme_dotplot<-theme_bw(16)+
  theme(axis.text.y=element_text(size=rel(.75)),
        axis.ticks.y=element_blank(),
        axis.title.x=element_text(size=rel(.75)),
        panel.grid.major.x=element_blank(),
        panel.grid.major.y=element_line(size=0.5),
        panel.grid.minor.x=element_blank())

To get the data for this problem, we’ll use the robotstxt package to check that it’s ok to scrape data from Rotten Tomatoes and then use the rvest package to get data from the web site.

  1. Use the paths_allowed() function from robotstxt to make sure it’s ok to scrape https://www.rottentomatoes.com/browse/box-office/. Then use rvest functions to find relative links to individual movies listed on this page. Finally, paste the base URL to each to create a character vector of URLs.

Display the first six lines of the vector.

getUrls <- function(baseUrl)
{
  if(paths_allowed(baseUrl)){
    linkData <- read_html("https://www.rottentomatoes.com/browse/box-office/") %>% 
      html_nodes("[target='_top']") %>% 
      html_attr("href") %>% 
      paste( "https://www.rottentomatoes.com", ., sep="")
  }
}

linkData <- getUrls("https://www.rottentomatoes.com/browse/box-office/")
print(linkData[1:6])
## [1] "https://www.rottentomatoes.com/m/abominable/"      
## [2] "https://www.rottentomatoes.com/m/downton_abbey/"   
## [3] "https://www.rottentomatoes.com/m/hustlers_2019/"   
## [4] "https://www.rottentomatoes.com/m/it_chapter_two/"  
## [5] "https://www.rottentomatoes.com/m/ad_astra/"        
## [6] "https://www.rottentomatoes.com/m/rambo_last_blood/"
  1. Write a function to read the content of one page and pull out the title, tomatometer score and audience score of the film. Then iterate over the vector of all movies using do.call() / rbind() / lapply() or dplyr::bind_rows() / purrr::map() to create a three column data frame (or tibble).

Display the first six lines of your data frame.

(Results will vary depending on when you pull the data.)

For help, see this SO post: https://stackoverflow.com/questions/36709184/build-data-frame-from-multiple-rvest-elements

library(stringr)
getDataFromLink <- function(url){
  out <- tryCatch({
    web <- read_html(url)
    },
    error = function(e) return(c("error"))
  )
  if(out[1] != "error"){

    title <- web %>% html_nodes("[class='mop-ratings-wrap__title mop-ratings-wrap__title--top']") %>% html_text()
    score <- web %>% html_nodes("[class='mop-ratings-wrap__percentage']") %>% html_text() %>% str_extract_all("\\(?[0-9]+%")
  
    tomatometer = "NA"
    audience = "NA"
    
    
    if(length(score) != 0){
        tomatometer = score[[1]]
    }
    if(length(score) == 2){
        audience = score[[2]]
    }
  
      
    data_frame(title,tomatometer, audience)
    }
} 

webData <- bind_rows(lapply(linkData, getDataFromLink))
webData

Write your data to file so you don’t need to scrape the site each time you need to access it.

  1. Create a Cleveland dot plot of tomatometer scores.
ggplot(webData, aes(tomatometer, title)) + geom_point() + theme_dotplot

  1. Create a Cleveland dot plot of tomatometer and audience scores on the same graph, one color for each. Sort by audience score.
ggplot(webData, aes(y=title)) + geom_point(aes(x = tomatometer, colour = "tomatometer score")) + geom_point(aes(x = audience, colour = "audience score")) + theme_dotplot

  1. Run your code again for the weekend of July 5 - July 7, 2019. Use plotly to create a scatterplot of audience score vs. tomatometer score with the ability to hover over the point to see the film title.
library(plotly)
f <- list(
  family = "Courier New, monospace",
  size = 18,
  color = "#7f7f7f"
)
x <- list(
  title = "tomatometer score",
  titlefont = f
)
y <- list(
  title = "audience score",
  titlefont = f
)

webDataJulyUrls <- getUrls("https://www.rottentomatoes.com/browse/box-office/?rank_id=11&country=us")
webDataJuly <- bind_rows(lapply(webDataJulyUrls, getDataFromLink))
plot_ly(webDataJuly, x = ~as.numeric(sub("%", "", tomatometer)), y = ~as.numeric(sub("%", "", audience)), text=~title) %>%
  layout(xaxis = x, yaxis = y) %>%
  add_markers()

### 3. Weather

[14 points]

library(nycflights13)

Data: weather dataset in nycflights13 package (available on CRAN)

For parts (a) - (d) draw four plots of wind_dir vs. humid as indicated. For all, adjust parameters to the levels that provide the best views of the data.

g <- ggplot(weather, aes(x=wind_dir, y=humid))
  1. Points with alpha blending
g + geom_point(alpha=0.2, stroke=0) 

  1. Points with alpha blending + density estimate contour lines
g + geom_point(alpha=0.2, stroke=0) + geom_density_2d()

  1. Hexagonal heatmap of bin counts
g + geom_hex()

  1. Square heatmap of bin counts
g + geom_bin2d()

  1. Describe noteworthy features of the data, using the “Movie ratings” example on page 82 (last page of Section 5.3) as a guide.

  2. Draw a scatterplot of humid vs. temp. Why does the plot have diagonal lines?

ggplot(weather, aes(x=humid, y=temp)) + geom_point()

  1. Draw a scatterplot matrix of the continuous variables in the weather dataset. Which pairs of variables are strongly positively associated and which are strongly negatively associated?
pairs(weather[5:15])

(h) Color the points by origin. Do any new patterns emerge?

pairs(weather[5:15], col=factor(weather$origin))